Embedding Project 1
Project Goal
- This project aims to create a tool for searching podcast transcripts to easily revisit specific segments of episodes from a YouTube playlist.
- For instance, One of the podcast I follow has 215 episodes, and manually searching for relevant sections in old episodes is time-consuming. This tool aims to automate the process by enabling semantic-based searches across transcripts, returning precise video segments with timestamps. While the concept can apply to any video or audio with transcripts, this implementation focuses on YouTube playlist videos.
- Here is what the output of the program looks like
» uv run main.py --name pt --search "the word cat only means something because it isn't the word cow"
Searching for: "the word cat only means something because it isn't the word cow" in collection: pt
1. Episode #115 ... Structuralism and Context. Confidence score = 0.64401466.
- https://www.youtube.com/watch?v=CZSrvFKGuC8&start=923&end=954
2. Episode #117 ... Structuralism and Mythology pt. 2. Confidence score = 0.6229924.
- https://www.youtube.com/watch?v=adlc1yY47xE&start=1193&end=1223
3. Episode #216 ... The Self-Overcoming of Nihilism - Kyoto School pt. 1 - Nishitani. Confidence score = 0.5024827.
- https://www.youtube.com/watch?v=eD_gi25iBIE&start=1494&end=1524
Tools Used
- Vector Database: Qdrant, for storage and retrieval of transcript embedding.
- Text Embedding Model: SentenceTransformer (all-MiniLM-L6-v2), for generating semantic embedding of transcript text.
- YouTube Video Downloader: yt-dlp, for downloading transcripts without video files.
- Subtitles Processing Library: webvtt, for parsing subtitle files (VTT format) with timestamps.
Process Overview
- The workflow involves downloading transcripts, processing them into searchable chunks, and enabling keyword-based searches using a vector database.
Step 1: Downloading Transcripts
- Tool: yt-dlp
- Process:
- Use yt-dlp to download transcripts from a YouTube playlist without downloading video files.
- Key yt-dlp options:
- --skip-download: Downloads only transcripts, not video files.
- --ignore-errors: Skips private or unavailable videos.
- --sub-format vtt: Downloads subtitles in WebVTT (.vtt) format for compatibility with YouTube's default subtitle format (other formats like SRT are also supported).
- The script supports multiple playlists, identified by a unique collection name(i.e playlist name). If the same collection name is reused, previous data (folders, files, and database entries) is deleted to start fresh.
input_path = f"collections/{collection_name}/subtitles" # where to store the downloaded subtitles
if os.path.isdir(input_path):
print("Deleting existing directory")
shutil.rmtree(input_path)
ydl_opts = {
'skip_download': True,
'subtitlesformat': 'vtt',
'subtitleslangs': ['en'],
'writeautomaticsub': True,
'writesubtitles': True,
"ignoreerrors": True, #skip private videos
"paths": {
"home": input_path
}
}
try:
# Create a YoutubeDL instance and extract information
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
# Download/extract subtitles
error_code = ydl.download([load_url])
print(f"yt_dlp error code {error_code}")
Step 2: Processing Transcripts
- Library: webvtt
- Process:
- Read .vtt subtitle files using webvtt to extract transcript text and timestamps.
- Address duplicate subtitle entries (a common issue in YouTube transcripts) by using a dictionary to store unique lines, effectively simulating a sorted set since Python lacks a built-in sorted set data structure.
- Transform transcripts into searchable chunks:
- Chunking: Split transcripts into 30-second segments (configurable duration, chosen experimentally).
- Extracted Information:
- Transcript with Timestamps: Each chunk includes text and start/end timestamps.
- Video ID: Used to construct YouTube URLs with start/end times for direct access to the relevant segment.
- File Name: Acts as a grouping key to limit search results to one match per video, avoiding multiple results from the same episode.
print(f"Downloading subtitle done, starting chunk process")
segment_point_in_second = 30 # segments subtitles per 30-second window frame
chunks = process_subtitle_file(input_path, segment_point_in_second)
print(f"Finished creating chunks for: {len(chunks)} files, next storing to db")
store_subtitle_data(
chunks=chunks,
collection_name=collection_name,
client=qdrant_client,
encoder=encoder
)
print("Finished storing.")
def process_subtitle_file(input_path, segment_time_second: int):
chunks = defaultdict(list)
for file_name in os.listdir(input_path):
file_path = os.path.join(input_path, file_name)
start_time = 0
text = {}
file_name_clean = clean_file_name(file_name)
captions = webvtt.read(file_path)
for caption in captions:
end_time = time_to_seconds(caption.end)
for t in caption.text.strip().split("\n"):
text[t] = None
if end_time - start_time >= segment_time_second or caption == captions[-1]:
chunks[file_name_clean].append({
"file_name": file_name_clean,
"start": start_time,
"end": end_time,
"text": " ".join(text.keys()),
"video_id": get_youtube_id(file_name)
})
text = {}
start_time = end_time
return chunks
Step 3: Searching
- Process:
- Accept a user query
- Search the Qdrant database to find the most relevant transcript chunks based on semantic similarity.
- Return formatted results, including:
- The matching transcript text.
- A YouTube URL with timestamps to jump directly to the relevant video segment.
- The video title or filename for context.
print(f"Searching for: '{search_query}' in collection: '{collection_name}")
hits = initiate_rag_search(
query=args.search,
collection_name=collection_name,
client=qdrant_client,
encoder=encoder
)
counter = 1
for pointGroup in hits.groups:
for hit in pointGroup.hits:
metas = hit.payload
file_name = metas["file_name"]
start_time = metas["start"]
end_time = metas["end"]
video_id = metas["video_id"]
print(f"{counter}. {file_name}. Confidence score = {hit.score}.")
print(f"\t - https://www.youtube.com/watch?v={video_id}&start={start_time}&end={end_time}")
counter += 1